\chapter{Simulation Failure Capturing}\label{failure}

Dakota provides the capability to manage failures in simulation codes
within its system call, fork, and direct simulation interfaces (see
Section~\ref{interfaces:sim} for simulation interface descriptions).
Failure capturing consists of three operations: failure detection,
failure communication, and failure mitigation.

\section{Failure detection}\label{failure:detection}

Since the symptoms of a simulation failure are highly code and
application dependent, it is the user's responsibility to detect
failures within their \texttt{analysis\_driver},
\texttt{input\_filter}, or \texttt{output\_filter}. One popular 
example of simulation monitoring is to rely on a simulation's internal
detection of errors.  In this case, the UNIX \texttt{grep} utility can
be used within a user's driver/filter script to detect strings in
output files which indicate analysis failure. For example, the
following simple C shell script excerpt
\begin{verbatim}
    grep ERROR analysis.out > /dev/null
    if ( $status == 0 )
      echo "FAIL" > results.out
    endif
\end{verbatim}

will pass the \texttt{if} test and communicate simulation failure to
Dakota if the \texttt{grep} command finds the string \texttt{ERROR}
anywhere in the \path{analysis.out} file. The \path{/dev/null}
device file is called the ``bit bucket'' and the \texttt{grep} command
output is discarded by redirecting it to this destination. The
\texttt{\$status} shell variable contains the exit status of the last
command executed~\cite{And86}, which is the exit status of \texttt{grep}
in this case (0 if successful in finding the error string, nonzero
otherwise). For Bourne shells~\cite{Bli96}, the \texttt{\$?} shell
variable serves the same purpose as \texttt{\$status} for C shells. In
a related approach, if the return code from a simulation can be used
directly for failure detection purposes, then \texttt{\$status} or
\texttt{\$?} could be queried immediately following the simulation
execution using an \texttt{if} test like that shown above.

If the simulation code is not returning error codes or providing
direct error diagnostic information, then failure detection may
require monitoring of simulation results for sanity (e.g., is the mesh
distorting excessively?) or potentially monitoring for continued
process existence to detect a simulation segmentation fault or core
dump. While this can get complicated, the flexibility of Dakota's
interfaces allows for a wide variety of user-defined monitoring
approaches.

\section{Failure communication}\label{failure:communication}

Once a failure is detected, it must be communicated so that Dakota can
take the appropriate corrective action. The form of this communication
depends on the type of simulation interface in use.

In the system call and fork simulation interfaces, a detected
simulation failure is communicated to Dakota through the results file.
Instead of returning the standard results file data, the string
``\texttt{fail}'' should appear at the beginning of the results file.
Any data appearing after the fail string will be ignored. Also,
Dakota's detection of this string is case insensitive, so
``\texttt{FAIL}'', ``\texttt{Fail}'', etc., are equally valid.

In the direct simulation interface case, a detected simulation
failure is communicated to Dakota through the return code provided by
the user's \texttt{analysis\_driver},
\texttt{input\_filter}, or \texttt{output\_filter}. As shown in
Section~\ref{advint:direct:extension}, the prototype for simulations
linked within the direct interface includes an integer return code.
This code has the following meanings: zero (false) indicates that all
is normal and nonzero (true) indicates an exception (i.e., a
simulation failure).

\section{Failure mitigation}\label{failure:mitigation}

Once the analysis failure has been communicated, Dakota will attempt
to recover from the failure using one of the following four
mechanisms, as governed by the interface specification in the user's
input file (see the Interface Commands chapter in the Dakota Reference
Manual~\cite{RefMan} for additional information).

\subsection{Abort (default)}\label{failure:mitigation:abort}

If the \texttt{abort} option is active (the default), then Dakota will
terminate upon detecting a failure. Note that if the problem causing
the failure can be corrected, Dakota's restart capability (see
Chapter~\ref{restart}) can be used to continue the study.

\subsection{Retry}\label{failure:mitigation:retry}

If the \texttt{retry} option is specified, then Dakota will re-invoke
the failed simulation up to the specified number of retries. If the
simulation continues to fail on each of these retries, Dakota will
terminate. The retry option is appropriate for those cases in which
simulation failures may be resulting from transient computing
environment issues, such as shared disk space, software license
access, or networking problems.

\subsection{Recover}\label{failure:mitigation:recover}

If the \texttt{recover} option is specified, then Dakota will not
attempt the failed simulation again. Rather, it will return a ``dummy''
set of function values as the results of the function evaluation. The
dummy function values to be returned are specified by the user. Any
gradient or Hessian data requested in the active set vector will be
zero. This option is appropriate for those cases in which a failed
simulation may indicate a region of the design space to be avoided and
the dummy values can be used to return a large objective function or 
constraint violation which will discourage an optimizer from further
investigating the region.

\subsection{Continuation}\label{failure:mitigation:continuation}

If the \texttt{continuation} option is specified, then Dakota will
attempt to step towards the failing ``target'' simulation from a nearby
``source'' simulation through the use of a continuation algorithm. This
option is appropriate for those cases in which a failed simulation may
be caused by an inadequate initial guess. If the ``distance'' between
the source and target can be divided into smaller steps in which
information from one step provides an adequate initial guess for the
next step, then the continuation method can step towards the target in
increments sufficiently small to allow for convergence of the
simulations.

When the failure occurs, the interval between the last successful
evaluation (the source point) and the current target point is halved
and the evaluation is retried. This halving is repeated until a
successful evaluation occurs. The algorithm then marches towards the
target point using the last interval as a step size. If a failure
occurs while marching forward, the interval will be halved again. Each
invocation of the continuation algorithm is allowed a total of ten
failures (ten halvings result in up to 1024 evaluations from source to
target) prior to aborting the Dakota process.

While Dakota manages the interval halving and function evaluation
invocations, the user is responsible for managing the initial guess
for the simulation program. For example, in a GOMA input
file~\cite{Sch95}, the user specifies the files to be used for reading
initial guess data and writing solution data. When using the last
successful evaluation in the continuation algorithm, the translation
of initial guess data can be accomplished by simply copying the
solution data file leftover from the last evaluation to the initial
guess file for the current evaluation (and in fact this is useful for
all evaluations, not just continuation). However, a more general
approach would use the \emph{closest} successful evaluation (rather
than the \emph{last} successful evaluation) as the source point in the
continuation algorithm. This will be especially important for nonlocal
methods (e.g., genetic algorithms) in which the last successful
evaluation may not necessarily be in the vicinity of the current
evaluation. This approach will require the user to save and manipulate
previous solutions (likely tagged with evaluation number) so that the
results from a particular simulation (specified by Dakota after
internal identification of the closest point) can be used as the
current simulation's initial guess.  This more general approach is not
yet supported in Dakota.

\section{Special values} \label{failure:special}

In IEEE arithmetic, ``NaN'' indicates ``not a number'' and
$\pm$``Inf'' or $\pm$``Infinity" indicates positive or negative
infinity.  These special values may be returned directly in function
evaluation results from a simulation interface or they may be
specified in a user's input file within the \texttt{recover}
specification described in Section~\ref{failure:mitigation:recover}.
There is a key difference between these two cases.  In the former case
of direct simulation return, failure mitigation can be managed on a
per response function basis.  When using \texttt{recover}, however,
the failure applies to the complete set of simulation results.

In both of these cases, the handling of NaN or Inf is managed using
iterator-specific approaches.  Currently, nondeterministic sampling
methods (see Section~\ref{uq:sampling}), polynomial chaos expansions
using either regression approaches or spectral projection with random
sampling (see Section~\ref{uq:expansion}), and the NL2SOL method for
nonlinear least squares (see \S\ref{nls:solution:nl2sol}) are the only
methods with special numerical exception handling: the sampling
methods simply omit any samples that are not finite from the
statistics generation, the polynomial chaos methods omit any samples
that are not finite from the coefficient estimation, and NL2SOL treats
NaN or Infinity in a residual vector (i.e., values in a results file
for a function evaluation) computed for a trial step as an indication
that the trial step was too long and violates an unstated constraint;
NL2SOL responds by trying a shorter step.
